Project on Red Wine Quality

This project explores each variables and find out which variables are leading influencer on quality of red wine.

The data I am going to use is a dataset on red wine.

The data was collected in 2009 by Paulo Corte3z, Antonio Cerdeira, Fernado Almeida, Telmo Matos and Jose Reis to explore the relationship between quality of wine and its chemical substances.

I was interestedin exploring this dataset since I love drinking wine and was wondering which factor has the most impact on quality of red wine.

Through plots and analysis I hope I could find some of the factors that could help me explain the quality of red wine.

Univariate Plots Section

## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

As you could see above, there are 12 variables and 1599 observations.

I will now look at the distribution of each 12 variables.

Fixed Acidity(tartaric acid - g / dm^3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

From looking at the histogram of the fixed acidity, we could notice that distribution of fixed.acidty is normal with peak around at 7.8. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.

Volatile Acidity(acetic acid - g / dm^3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4932  0.7306  0.8041  0.7977  0.8618  1.1647

From looking at the histogram of the volatile.acidity^(1/3), we could notice that distribution of volatile.acidity^(1/3) is normal with peak around at 0.85. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.

Citric Acid (g / dm^3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

From looking at the histogram of the sqrt(citric.acid), we could notice that distribution of sqrt(citric.acid) is normal with peak around at 0.5 and 0.75. There is another peak in 0 since sqrt of 0 is 0, which means transformation did not have effect on 0.

## TRUE 
##  132

There are 132 wines that have 0 citric acid.

Residual Sugar (g / dm^3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

From looking at the histogram of the log10(residual.sugar), we could notice that distribution of log10(residual.sugar) is normal with peak around at 0.3 and 0.4. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.

Chlorides (sodium chloride - g / dm^3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

From looking at the histogram of the log10(chlorides), we could notice that distribution of log10(chlorides) is normal with peak around at -1.1 and -1.2. There is suspected outlier on the both sides and I should consider whether to exclude the outlier or not.

Free Sulfur Dioxide (mg / dm^3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

From looking at the histogram of the sqrt(free.sulfur.dioxide), we could notice that distribution of sqrt(free.sulfur.dioxide) is skewed to right with peaks around at 2.5 and 4. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.

Total Sulfur Dioxide (mg / dm^3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

From looking at the histogram of the log10(total.sulfur.dioxide), we could notice that distribution of log10(total.sulfur.dioxide) is normal with peaks around at 1.5 and 1.75. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.

In total sulfur dioxide there is free and bound forms. I will make another variable for bound sulfur dioxide by subtracting free sulfur dioxide from total sulfur dioxide.

Bound Sulfur dioxide (mg / dm^3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   12.00   21.00   30.59   39.00  251.50

From looking at the histogram of the log10(bound.sulfur.dioxide), we could notice that distribution of log10(bound.sulfur.dioxide) is normal with peaks around at 1 and 1.5. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.

Density (g / cm^3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

From looking at the histogram of the density, we could notice that distribution of density is normal with peak around at 0.997.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

From looking at the histogram of the pH, we could notice that distribution of pH is normal with peak around at 3.25 and 3.3.

Sulphates (potassium sulphate - g / dm3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

From looking at the histogram of the log10(sulphates), we could notice that distribution of log10(sulphates) is normal to right with peaks around at -0.2. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.

Alcohol (% by volume)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

From looking at the histogram of the alcohol, we could notice that distribution of alcohol is skewed to right with peaks around at 9.5. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.

I will make another variable with dividing alcohol into 5 categories.

Alcohol Level

## 
##  very low       low    medium      high very high 
##       552       639       304        96         8

From looking at the table and plot we could notice that most alcohol are below 11.

Quality (score between 0 and 10)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

From looking at the table above, we could notice that most wine received 5 or 6 on their quality.

I will make another variables with dividing quality into 3 categories(low, medium, high) and two categories(low, high)

Quality Level(low, medium, high)

## 
##    low medium   high 
##     63   1319    217

From looking at the table we could notice that most red wines fall into medium category which is wine with 5 or 6 quality.

Quality Level(low, high)

## 
##  low high 
##  744  855

From looking at the table we could notice quality is almost evenly divided into low and high.

Analysis

There are 1599 red wine in the dataset with 13 variables, including that I have made.

## 'data.frame':    1599 obs. of  16 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ bound.sulfur.dioxide: num  23 42 39 43 23 27 44 6 9 85 ...
##  $ alcohol_lev         : Factor w/ 5 levels "very low","low",..: 1 2 2 2 1 1 1 2 1 2 ...
##  $ quality_3           : Factor w/ 3 levels "low","medium",..: 2 2 2 2 2 2 2 3 3 2 ...
##  $ quality_2           : Factor w/ 2 levels "low","high": 1 1 1 2 1 1 1 2 2 1 ...

Observation

  • median for fixed acidity is 7.9
  • mean for volatile acidity is 0.5278
  • most red wine had 0 ~ 0.5 citric acid
  • median for residual sugar is 2.2
  • median for chlorides is 0.079
  • median for free sulfur dioxide is 14
  • median for total sulfur dioxide is 38
  • median for bound sulfur dioxide is 21
  • mean for density is 0.9967
  • mean for pH is 3.311
  • median for sulphates 0.62
  • median for alcohol is 10.20

I have added bound sulfur dioxide variable because bound sulfur might be the one that influence the quality of the wine.

I also had to transform various variables to make the distribution normal. Most of the graphes were skewed to right, so I used log10 and sqrt function to make the distribution normal.

For next secion, I will explore to determine which variables are best for predicting the quality of the red wine.

Bivariate Plots Section

From looking at above, we could notice that citric acid, alcohol, and sulphates of red wine has highest correlation to quality of red wine.

From looking at the table above we could notice that there might be Multicollinearity problem if we look at multivariate relationship. I will consider this fact in the next section for multivariate relatinoship.

From looking at the table above it is hard to notice whether there is linear relationship between variables with quality of red wine, especially when quality is more of a categorical variable.

I will look closely into it through graphing and using anova test.

Quality x Fixed Acidity

From looking at the scatter plot above, it is hard to notice the relationship between red wine quality and fixed acidity, especially since wine quality is a categorical variable. I will look into box plots to examine the relationship and I will change the quality as a factor from numeric to make a boxplot.

## 'data.frame':    1599 obs. of  16 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
##  $ bound.sulfur.dioxide: num  23 42 39 43 23 27 44 6 9 85 ...
##  $ alcohol_lev         : Factor w/ 5 levels "very low","low",..: 1 2 2 2 1 1 1 2 1 2 ...
##  $ quality_3           : Factor w/ 3 levels "low","medium",..: 2 2 2 2 2 2 2 3 3 2 ...
##  $ quality_2           : Factor w/ 2 levels "low","high": 1 1 1 2 1 1 1 2 2 1 ...

From looking at the boxplots above we don’t see much strong relationship between quality and fixed acidity.

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## quality        5     94  18.737   6.283 8.79e-06 ***
## Residuals   1593   4751   2.982                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and fixed acidity. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = fixed.acidity ~ quality, data = wine)
## 
## $quality
##            diff         lwr       upr     p adj
## 4-3 -0.58075472 -2.27948640 1.1179770 0.9257629
## 5-3 -0.19274596 -1.76223366 1.3767417 0.9993075
## 6-3 -0.01282132 -1.58307424 1.5574316 1.0000000
## 7-3  0.51236181 -1.08439601 2.1091196 0.9426320
## 8-3  0.20666667 -1.73661257 2.1499459 0.9996570
## 5-4  0.38800876 -0.31462496 1.0906425 0.6148684
## 6-4  0.56793340 -0.13640797 1.2722748 0.1942859
## 7-4  1.09311653  0.33151423 1.8547188 0.0006306
## 8-4  0.78742138 -0.55672768 2.1315705 0.5509949
## 6-5  0.17992465 -0.09155105 0.4514003 0.4080237
## 7-5  0.70510777  0.30806829 1.1021472 0.0000067
## 8-5  0.39941263 -0.77716674 1.5759920 0.9278394
## 7-6  0.52518313  0.12512942 0.9252368 0.0025626
## 8-6  0.21948798 -0.95811196 1.3970879 0.9948930
## 8-7 -0.30569514 -1.51841231 0.9070220 0.9796484

From the graph above, we could notice that there is a significant difference between 7-4, 7-5 and 7-6.

Quality x Volatile Acidity

From looking at the boxplots above, we could notice some relationship between volatile acidity and quality. It seems like volatile acidity decrease as quality of wine increases.

##               Df Sum Sq Mean Sq F value Pr(>F)    
## quality        5   8.22   1.645   60.91 <2e-16 ***
## Residuals   1593  43.01   0.027                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and volatile acidity. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = volatile.acidity ~ quality, data = wine)
## 
## $quality
##            diff         lwr         upr     p adj
## 4-3 -0.19053774 -0.35217798 -0.02889749 0.0102247
## 5-3 -0.30745888 -0.45680111 -0.15811665 0.0000001
## 6-3 -0.38701567 -0.53643072 -0.23760063 0.0000000
## 7-3 -0.48058040 -0.63251748 -0.32864332 0.0000000
## 8-3 -0.46116667 -0.64607647 -0.27625687 0.0000000
## 5-4 -0.11692115 -0.18377920 -0.05006310 0.0000099
## 6-4 -0.19647794 -0.26349848 -0.12945740 0.0000000
## 7-4 -0.29004267 -0.36251178 -0.21757355 0.0000000
## 8-4 -0.27062893 -0.39852940 -0.14272846 0.0000000
## 6-5 -0.07955679 -0.10538865 -0.05372493 0.0000000
## 7-5 -0.17312152 -0.21090121 -0.13534183 0.0000000
## 8-5 -0.15370778 -0.26566341 -0.04175215 0.0013080
## 7-6 -0.09356473 -0.13163123 -0.05549822 0.0000000
## 8-6 -0.07415099 -0.18620374  0.03790175 0.4098254
## 8-7  0.01941374 -0.09598053  0.13480800 0.9968509

From the graph above, we could notice that there is a significant difference between every variables except 6-8 and 7-8. From looking at the graph above, I am considering whether I should group the quality into three sections as low(3-4), medium(5-6), and high(7-8) in order to better explained the relationship between volatile acidity and quality.

Quality x Citric Acid

From looking at the boxplots above, we could notice some relationship between citric acid and quality. It seems like citric acid increase as quality of wine increases.

##               Df Sum Sq Mean Sq F value Pr(>F)    
## quality        5   3.53  0.7059   19.69 <2e-16 ***
## Residuals   1593  57.11  0.0359                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and citric acid. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = citric.acid ~ quality, data = wine)
## 
## $quality
##            diff           lwr        upr     p adj
## 4-3 0.003150943 -0.1831058063 0.18940769 1.0000000
## 5-3 0.072685756 -0.0994000884 0.24477160 0.8345084
## 6-3 0.102824451 -0.0693452965 0.27499420 0.5292715
## 7-3 0.204175879  0.0291000127 0.37925175 0.0115446
## 8-3 0.220111111  0.0070410437 0.43318118 0.0381644
## 5-4 0.069534813 -0.0075051774 0.14657480 0.1039655
## 6-4 0.099673508  0.0224462830 0.17690073 0.0032561
## 7-4 0.201024936  0.1175193597 0.28453051 0.0000000
## 8-4 0.216960168  0.0695814861 0.36433885 0.0004036
## 6-5 0.030138695  0.0003728525 0.05990454 0.0451915
## 7-5 0.131490123  0.0879568901 0.17502336 0.0000000
## 8-5 0.147425355  0.0184197856 0.27643092 0.0144221
## 7-6 0.101351428  0.0574877008 0.14521516 0.0000000
## 8-6 0.117286660 -0.0118308104 0.24640413 0.0998116
## 8-7 0.015935232 -0.1170326519 0.14890312 0.9993852

From the graph above, we could notice that there is a significant difference between 7-3, 8-3, 6-4, 7-4, 8-4, 6-5, 7-5, 8-5, and 7-6.

Quality x Residual Sugar

From looking at the boxplots above we don’t see much strong relationship between quality and residual sugar.

##               Df Sum Sq Mean Sq F value Pr(>F)
## quality        5     10   2.094   1.053  0.385
## Residuals   1593   3166   1.988

From looking at the anova table, we cannot reject null hypothesis that there isn’t significant relationship between quality and residual sugar.

Quality x Chlorides

From looking at the boxplots above we don’t see much strong relationship between quality and chlorides.

##               Df Sum Sq  Mean Sq F value   Pr(>F)    
## quality        5  0.066 0.013162   6.036 1.53e-05 ***
## Residuals   1593  3.474 0.002181                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and chlorides. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = chlorides ~ quality, data = wine)
## 
## $quality
##             diff         lwr           upr     p adj
## 4-3 -0.031820755 -0.07775835  0.0141168441 0.3563775
## 5-3 -0.029764317 -0.07220686  0.0126782279 0.3421496
## 6-3 -0.037543887 -0.08000713  0.0049193515 0.1180933
## 7-3 -0.045912060 -0.08909205 -0.0027320685 0.0295304
## 8-3 -0.054055556 -0.10660628 -0.0015048304 0.0395900
## 5-4  0.002056438 -0.01694439  0.0210572639 0.9996262
## 6-4 -0.005723132 -0.02477014  0.0133238728 0.9563871
## 7-4 -0.014091306 -0.03468678  0.0065041663 0.3707314
## 8-4 -0.022234801 -0.05858367  0.0141140711 0.5018527
## 6-5 -0.007779570 -0.01512090 -0.0004382449 0.0304543
## 7-5 -0.016147743 -0.02688460 -0.0054108855 0.0002720
## 8-5 -0.024291238 -0.05610864  0.0075261647 0.2484623
## 7-6 -0.008368173 -0.01918654  0.0024501961 0.2349638
## 8-6 -0.016511668 -0.04835667  0.0153333334 0.6775878
## 8-7 -0.008143495 -0.04093815  0.0246511567 0.9809645

From the graph above, we could notice that there is a significant difference between 7-3, 8-3, 6-5, and 7-5.

Quality x Free Sulfur Dioxide

From looking at the boxplots above we don’t see much strong relationship between quality and free sulfur dioxide.

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## quality        5   2571   514.1   4.754 0.000257 ***
## Residuals   1593 172274   108.1                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and free sulfur dioxide. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = free.sulfur.dioxide ~ quality, data = wine)
## 
## $quality
##           diff         lwr        upr     p adj
## 4-3  1.2641509  -8.9655860 11.4938879 0.9992862
## 5-3  5.9838473  -3.4675842 15.4352788 0.4618281
## 6-3  4.7115987  -4.7444410 14.1676385 0.7138656
## 7-3  3.0452261  -6.5704257 12.6608780 0.9456583
## 8-3  2.2777778  -9.4246209 13.9801764 0.9937451
## 5-4  4.7196963   0.4884466  8.9509461 0.0185784
## 6-4  3.4474478  -0.7940854  7.6889810 0.1868980
## 7-4  1.7810752  -2.8052825  6.3674329 0.8782125
## 8-4  1.0136268  -7.0808188  9.1080725 0.9992387
## 6-5 -1.2722485  -2.9070711  0.3625740 0.2288173
## 7-5 -2.9386212  -5.3295870 -0.5476553 0.0061996
## 8-5 -3.7060695 -10.7914129  3.3792739 0.6692481
## 7-6 -1.6663726  -4.0754901  0.7427448 0.3580539
## 8-6 -2.4338210  -9.5253103  4.6576683 0.9246011
## 8-7 -0.7674484  -8.0704130  6.5355163 0.9996765

From the graph above, we could notice that there is a significant difference between 5-4 and 7-5.

Quality x Bound Sulfur Dioxide

From looking at the boxplots above we don’t see much strong relationship between quality and bound sulfur dioxide.

##               Df  Sum Sq Mean Sq F value Pr(>F)    
## quality        5   98706   19741   29.36 <2e-16 ***
## Residuals   1593 1071097     672                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and bound sulfur dioxide. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = bound.sulfur.dioxide ~ quality, data = wine)
## 
## $quality
##            diff        lwr        upr     p adj
## 4-3  10.0811321 -15.426418  35.588682 0.8699705
## 5-3  25.6301028   2.063234  49.196971 0.0238790
## 6-3  11.2583072 -12.320052  34.836666 0.7496028
## 7-3   7.0748744 -16.901472  31.051221 0.9596104
## 8-3   6.2666667 -22.912922  35.446256 0.9901376
## 5-4  15.5489707   4.998473  26.099468 0.0003955
## 6-4   1.1771751  -9.398964  11.753314 0.9995713
## 7-4  -3.0062577 -14.442207   8.429691 0.9754993
## 8-4  -3.8144654 -23.997729  16.368798 0.9945496
## 6-5 -14.3717956 -18.448178 -10.295413 0.0000000
## 7-5 -18.5552284 -24.517032 -12.593425 0.0000000
## 8-5 -19.3634361 -37.030533  -1.696339 0.0221440
## 7-6  -4.1834328 -10.190497   1.823631 0.3501748
## 8-6  -4.9916405 -22.674062  12.690781 0.9665845
## 8-7  -0.8082077 -19.017937  17.401521 0.9999955

From the graph above, we could notice that there is a significant difference between 5-3, 5-4, 6-5, 7-5 and 8-5.

Quality x Total Sulfur Dioxide

From looking at the boxplots above we don’t see much strong relationship between quality and total sulfur dioxide.

##               Df  Sum Sq Mean Sq F value Pr(>F)    
## quality        5  128045   25609   25.48 <2e-16 ***
## Residuals   1593 1601155    1005                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and total sulfur dioxide. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = total.sulfur.dioxide ~ quality, data = wine)
## 
## $quality
##           diff        lwr        upr     p adj
## 4-3  11.345283 -19.841526  42.532092 0.9051128
## 5-3  31.613950   2.799916  60.427984 0.0219162
## 6-3  15.969906 -12.858177  44.797989 0.6115609
## 7-3  10.120101 -19.194583  39.434784 0.9228115
## 8-3   8.544444 -27.131983  44.220872 0.9838108
## 5-4  20.268667   7.369100  33.168234 0.0001149
## 6-4   4.624623  -8.306295  17.555541 0.9112220
## 7-4  -1.225183 -15.207347  12.756982 0.9998676
## 8-4  -2.800839 -27.477908  21.876231 0.9995284
## 6-5 -15.644044 -20.628033 -10.660055 0.0000000
## 7-5 -21.493850 -28.783049 -14.204650 0.0000000
## 8-5 -23.069506 -44.670183  -1.468828 0.0283464
## 7-6  -5.849805 -13.194343   1.494732 0.2059503
## 8-6  -7.425462 -29.044876  14.193953 0.9243726
## 8-7  -1.575656 -23.839783  20.688471 0.9999539

From the graph above, we could notice that there is a significant difference between 5-3, 5-4, 6-5, 7-5 and 8-5.

Quality x Density

From looking at the boxplots above we don’t see much strong relationship between quality and density.

##               Df   Sum Sq   Mean Sq F value   Pr(>F)    
## quality        5 0.000230 4.594e-05    13.4 8.12e-13 ***
## Residuals   1593 0.005462 3.430e-06                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and density. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = density ~ quality, data = wine)
## 
## $quality
##              diff           lwr           upr     p adj
## 4-3 -9.215472e-04 -0.0027431246  9.000302e-04 0.7003175
## 5-3 -3.603730e-04 -0.0020433600  1.322614e-03 0.9902708
## 6-3 -8.489373e-04 -0.0025327449  8.348703e-04 0.7033996
## 7-3 -1.359729e-03 -0.0030719578  3.525005e-04 0.2088099
## 8-3 -2.251778e-03 -0.0043355875 -1.679681e-04 0.0253891
## 5-4  5.611742e-04 -0.0001922713  1.314620e-03 0.2747470
## 6-4  7.260987e-05 -0.0006826668  8.278865e-04 0.9997910
## 7-4 -4.381815e-04 -0.0012548599  3.784970e-04 0.6443084
## 8-4 -1.330231e-03 -0.0027715834  1.111221e-04 0.0899646
## 6-5 -4.885643e-04 -0.0007796721 -1.974566e-04 0.0000271
## 7-5 -9.993557e-04 -0.0014251075 -5.736038e-04 0.0000000
## 8-5 -1.891405e-03 -0.0031530698 -6.297397e-04 0.0002889
## 7-6 -5.107913e-04 -0.0009397754 -8.180729e-05 0.0090996
## 8-6 -1.402840e-03 -0.0026655999 -1.400810e-04 0.0193569
## 8-7 -8.920491e-04 -0.0021924653  4.083671e-04 0.3677080

From the graph above, we could notice that there is a significant difference between 8-3, 6-5, 7-5, 7-6 and 8-5.

Quality x pH

From looking at the boxplots above it seems like ph level decrease as quality increase. However, we should not conclude anything yet.

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## quality        5   0.51 0.10242   4.342 0.000628 ***
## Residuals   1593  37.58 0.02359                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and pH. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = pH ~ quality, data = wine)
## 
## $quality
##            diff         lwr           upr     p adj
## 4-3 -0.01649057 -0.16757254  0.1345914093 0.9996104
## 5-3 -0.09305140 -0.23263865  0.0465358649 0.4012091
## 6-3 -0.07992790 -0.21958322  0.0597274183 0.5767071
## 7-3 -0.10724623 -0.24925884  0.0347663821 0.2599190
## 8-3 -0.13077778 -0.30360935  0.0420537932 0.2578317
## 5-4 -0.07656083 -0.13905174 -0.0140699183 0.0064502
## 6-4 -0.06343733 -0.12608012 -0.0007945477 0.0451336
## 7-4 -0.09075567 -0.15849113 -0.0230202009 0.0019007
## 8-4 -0.11428721 -0.23383328  0.0052588577 0.0704301
## 6-5  0.01312350 -0.01102104  0.0372680287 0.6312170
## 7-5 -0.01419484 -0.04950677  0.0211171021 0.8615725
## 8-5 -0.03772638 -0.14236912  0.0669163551 0.9083845
## 7-6 -0.02731833 -0.06289835  0.0082616867 0.2425756
## 8-6 -0.05084988 -0.15558338  0.0538836280 0.7359924
## 8-7 -0.02353155 -0.13138831  0.0843252185 0.9893982

From the graph above, we could notice that there is a significant difference between 5-4, 6-4, and 7-4.

Quality x Sulphates

From looking at the boxplots above it seems like sulphates level increase as quality increase. However, we should not conclude anything yet.

##               Df Sum Sq Mean Sq F value Pr(>F)    
## quality        5   3.00  0.6000   22.27 <2e-16 ***
## Residuals   1593  42.91  0.0269                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and sulphates. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = sulphates ~ quality, data = wine)
## 
## $quality
##           diff         lwr        upr     p adj
## 4-3 0.02641509 -0.13504180 0.18787198 0.9972425
## 5-3 0.05096916 -0.09820366 0.20014199 0.9259342
## 6-3 0.10532915 -0.04391640 0.25457471 0.3348774
## 7-3 0.17125628  0.01949155 0.32302101 0.0164864
## 8-3 0.19777778  0.01307773 0.38247782 0.0276634
## 5-4 0.02455407 -0.04222814 0.09133628 0.9011170
## 6-4 0.07891406  0.01196955 0.14585857 0.0102225
## 7-4 0.14484119  0.07245428 0.21722810 0.0000002
## 8-4 0.17136268  0.04360729 0.29911807 0.0018695
## 6-5 0.05435999  0.02855743 0.08016255 0.0000000
## 7-5 0.12028712  0.08255028 0.15802395 0.0000000
## 8-5 0.14680861  0.03497998 0.25863725 0.0025621
## 7-6 0.06592713  0.02790380 0.10395045 0.0000123
## 8-6 0.09244862 -0.01947701 0.20437426 0.1723998
## 8-7 0.02652150 -0.08874188 0.14178487 0.9864895

From the graph above, we could notice that there is a significant difference between 7-3, 8-3, 6-4, 7-4, 8-4, 6-5, 7-5, 8-5, and 7-6.

Quality x Alcohol

From looking at the boxplots above it seems like alcohol level increase as quality increase. However, we should not conclude anything yet.

##               Df Sum Sq Mean Sq F value Pr(>F)    
## quality        5  483.9   96.79   115.9 <2e-16 ***
## Residuals   1593 1330.8    0.84                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and alcohol. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = alcohol ~ quality, data = wine)
## 
## $quality
##            diff          lwr         upr     p adj
## 4-3  0.31009434 -0.589020145 1.209208824 0.9231095
## 5-3 -0.05529369 -0.886001167 0.775413796 0.9999660
## 6-3  0.67451933 -0.156593176 1.505631838 0.1882542
## 7-3  1.51091290  0.665771726 2.356054069 0.0000056
## 8-3  2.13944444  1.110894424 3.167994465 0.0000001
## 5-4 -0.36538803 -0.737282044 0.006505993 0.0574326
## 6-4  0.36442499 -0.008372862 0.737222846 0.0597032
## 7-4  1.20081856  0.797713311 1.603923806 0.0000000
## 8-4  1.82935010  1.117911150 2.540789059 0.0000000
## 6-5  0.72981302  0.586124800 0.873501234 0.0000000
## 7-5  1.56620658  1.356059244 1.776353923 0.0000000
## 8-5  2.19473813  1.571991432 2.817484828 0.0000000
## 7-6  0.83639357  0.624650838 1.048136295 0.0000000
## 8-6  1.46492511  0.841638238 2.088211988 0.0000000
## 8-7  0.62853155 -0.013342374 1.270405467 0.0589299

From the graph above, we could notice that there is a significant difference between 7-3, 8-3, 7-4, 8-4, 6-5, 7-5, 8-5, 8-6 and 7-6.

Bivariate Analysis

F-value - Quality x Fixed Acidity: 6.283 - Quality x Volatile Acidity: 60.91 - Quality x Citric Acid: 19.69 - Quality x Residual Sugar: 1.053 - Quality x Chlorides: 6.036 - Quality x Free Sulfur Dioxide: 4.754 - Quality x Bound Sulfur Dioxide: 29.36 - Quality x Total Sulfur Dioxide: 25.48 - Quality x Density: 13.4 - Quality x pH: 4.342 - Quality x Sulphates: 22.27 - Quality x Alcohol: 115.9

Only residual sugar has low F-value to reject null hypothesis. Alcohol, volatile acidity, bound sulfur dioxide, sulphates, and citric acid had high f-value, so I will use these variables to further investigate the relationship.

From observation of bivariate plot, I noticed that it will better to reorganize quality into three categories as low, medium, and high.

Multivariate Plots Section

Quality x Alcohol x Volatile Acidity

From looking at the graph, high level quality red wines tends to have higher alcohol level compared to other wines. However, other than the alcohol level, it is hard to notice any strong relationship. Instead of dividing quality into low, medium, and high, I think a variable that divided quality into low(3~5) and high(6~8) might be better at explaining the relationship.

From the graph, we could notice that high quality wines tends to have higher alcohol level and lower volatile acidity compared to low quality wines. However, with high variance, we should be carefully about concluding any relationship. I will futher explore with other variables.

Quality x Alcohol x Bound Sulfur Dioxide

From looking at the graph it is hard to notice any relationship. Only thing I could notice is that high quality wines tends to have higher alcohol level compared to low quality wines. I will futher explore with other variables.

Quality x Alcohol x Sulphates

From looking at the graph it is hard to notice any relationship. Only thing I could notice is that low alcohol level wine is more dispersed in sulphates level compared to high alcohol level. I will futher explore with other variables.

Quality x Alcohol x Citric Acid

From looking at the graph, it seems like low quality wines tends to not only have lower alcohol level but also citric acid level. I will futher explore with other variables.

Quality x Volatile Acidity x Bound Sulfur Dioxide

From looking at the graph, high quality wines tend to have low volatile acidity and high bound sulfur dioxide compared to low quality wines. I will futher explore with other variables.

Quality x Volatile Acidity x Sulphates

From looking at the graph, it is hard to notice any relationship but we could notice that high quality wine tends to have higher sulphates and lower volatile acidity compared to low quality wine. I will also put alcohol variable into graph to observe the relationship. I will use alcohol level variable to observe the relationship.

From looking at the plot above, we could notice that high quality wine tends to have higher alcohol level, higher sulphate level, and low volatile acidity compared to low quality wine.

I will futher explore with other variables.

Quality x Volatile Acidity x Citric Acid

From looking at the graph, it is hard to notice any relationship but we could notice that high quality wine tends to have higher citric acid and lower volatile acidity compared to low quality wine. I will futher explore with other variables.

Quality x Bound Sulfur Dioxide x Sulphates

From looking at the graph, it is hard to notice any relationship but we could notice that high quality wine tends to have higher sulphates compared to low quality wine. I will futher explore with other variables.

Quality x Bound Sulfur Dioxide x Citric Acid

From looking at the graph, it is hard to notice any relationship but we could notice that high quality wine tends to have higher citric acid compared to low quality wine. I will futher explore with other variables.

Quality x Sulphates x Citric Acid

From looking at the graph, it is hard to notice any relationship but I could notice that high quality wine tends to have higher sulphates compared low quality wine. I will futher explore with other variables.

Multivariate Analysis

For most of the plots, it was hard to identify a strong relationship among variables. As it was shwon in univariate analysis, alcohol seemed to have most influence on the quality of wine than other variables.

Final Plots and Summary

Plot One

Description One

The plot indicates that wines with high quality tends to have high alcohol level compared to wines with low quality.

Plot Two

Description Two

Even though it is not evident, we could notice that high quality wine tends to have high sulphates and low volatile acidity compared to low quality wine. However, it would hard to predict a wine quality through just looking at sulphates and volatile acidity of red wine due to high variance.

Plot Three

Description Three

From looking at the plot we could notice that high quality wine tends to have high alcohol level, sulphates, and volatile acidity compared to low quality wines. As mention before, the relationship is not strong enough to predict quality of red wine based on its alcohol level, sulphates and volatile acidity.

Reflection

The red wine data set contained 1600 red wine with 12 variables. I have explored each variables distribution and bivariate model to identify relationship between variables and quality of red wine. I used boxplots to explore the relationship between quality and other variables. The difficulty with boxplots was there was exact standard to conclude whether the relationship bewteen quality and variable was strong enough. Especially, with small number of the data, variance was too large to identify the relationship. With numerous limitation, I still found alcohol, sulphates, and volatile acidity to have most influence on the quality of red wine.

From this project, I realize the quality of red wine, which is decided by wine experts, is more complex to be explained by those 12 given variables. For next time, I hope there could be more variables as price of the wine and data to explore. With the price of the wine and selling records on each wine, we could conduct analysis to see the difference on the preference between wine experts and others. Furthermore if we could have the information who buys which wine, then we could see which age group tends to like wine with high alcohol level.

I strongly feel like with more variables and data, there is so much we further explore.